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While  this  paper  presents  research  data  its  primary  purpose  is 
pedagogical*  Reliability  is  the  Achilles  heel  of  those  clinical  dis- 
ciplines employing  the  intuitive  judgmental  process  as  an  operating 
technique  and  hence  it  is  of  tremendous  interest  to  olinical  psychologists* 
Whenever  a clinical  study  yields  data  which  may  bear  upon  the  factor  of 
reliability,  such  reliability  is  eagerly  surveyed  and  reported,  and  always 
can  count  on  a fascinated  if  not  necessarily  enthusiastic  audience*  For 
our  purposes  here,  reliability  will  be  defined  as  inter-Judge  agreement, 
in  a judgmental  situation  closely  approaching  actual  operating  clinical 
practioe. 

Our  data  were  obtained  from  a previously  reported  study  in  which  60 
clinicians  were  given  a group  of  10  schizophrenic  responses  to  items  from 
the  Wechsler-Bellevue  and  Terman-Binet  vocabulary  tests  (2).  They  were 
then  asked  to  rate  each  of  the  responses  for  the  severity  of  the  disorder 
in  the  thinking  processes  exhibited  using  an  11-point  scale.  The  subjects 
were  60  professional  clinicians  with  fo’ir  years  or  more  on-the-job  profes- 
sional experience*  As  a measure  of  reliability  or  inter- judge  agreement 
we  correlated  the  rank  order  of  the  10  stimuli  for  each  judge  with  that  of 
the  rank  order  assigned  by  averaging  the  judgments  of  all  60  clinicians* 

While  there  is  some  contamination  here,  since  each  judge  contributed  to  the 
group  average,  the  proportion  of  1-60  render*  this  neg?J.giblet 

Brevity  and  economy  in  reporting  usually  dictate  the  use  of  some  single 
measure  of  reliability,  which  in  this  case  might  well  be  the  equivalent  of 
an  average  r*  For  our  purposes,  however,  it  seems  wiser  podagogically  to 
present  a complete  table  of  all  60  rs.  This  appears  in  Table  I*  Inspection 
immediately  shows  the  wide  range  of"rs  from  ■+•  .02  to  -+  f93,  with  a modal 
clustering  in  the  60*3*  This  might  7>e  viewed  qs  representing  a true  value 
in  the  60's  with  an  error  distribution  about  this  point,  or  it  might  bo  viewed 
as  a continuum  of  ability  with  individual  clinicians  distributed  upon  it. 
Actually,  it  mist  be  both  but  it  seems  safe  to  assume  that  differences  in 
ability  are  at  least  in  part  responsible  for  the  distribution,  and  that  in 
terms  of  the  ability  to  make  reliable  judgments  in  the  sense  used  here, 
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clinicians  vary  tremendously#  There  would  seem  to  be  "good"  clinicians  and 
"bad"  clinicians,  While  this  fact  is  implicitly  recognized  among  clinicians 
and  occasionally  reported  in  the  literature  (5),  it  is  seldom  taken  account 
of  in  either  experimental  designs  involving  clinical  judgment  (U)  or  in 
actual  clinical  practice  utilizing  such  judgments#  Certainly  this  wide  range 
of  ability  is  concealed  by  the  use  of  any  single  measure# 

To  illu3trato  this,  let  us  use  such  a single  measure#  We  select 
Alexander's  dr  as  a measure  of  the  average  r between  pairs  of  judges  (1)# 
When  such  an  overall  measure  is  applied  to  our  data  it  comes  out  *+-#33» 

It  is  an  honest  measure  and  statistically  justified,  but  in  this  case  it 
conceals  some  very  Important  information  concerning  the  rrr.ge  of  ability 
among  olinicians,  a fact  which  is  evident  if  we  consult  Table  I# 

So  far  we  have  been  considering  "reliability#"  Let  us  now  consider 
"chance."  Since  60  clinicians  is  an  unusually  largo  sample  for  this  type 
of  study,  we  may  feel  socure#  Most  studies  use  many  fewer  subjects#  Suppose 
we  nad  had  only  20  subjects  in  our  group,  we  can  attempt  to  answer  this 
conjecture  by  splitting  our  group  of  6Q  randomly  into  throe  groups  of  20 
clinicians  each#  When  wo  do  this  and  apply  the  same  measure,  we  find  the 
following  average  rsj  -f-  #19,  •+  .51*  +.26.  The  range  here  is  noticeable# 

In  fact  the  r of  + ,51  is  so  far  out  of  line  as  to  establish  the  lack  of 
homogeneity  In  this  sampling,  although  it  was  achieved  in  random  fashion. 

In  terms  of  our  total  sample  of  6C  it  is  evident  that  a value  of  + ,19  would 
underestimate  "typical"  clinical  ability  and  + #5l  would  overestimate  it. 

This  demonstrates  the  part  that  chance  (in  sampling)  may  play  in  measuring 
inter- judge  agreement  among  clinicians. 

Now  let  us  go  ono  step  further.  Let  us  assume  either  that  our  data  are 
not  amenable  to  treatment  by  the  method  of  rank  order  correlation,  or  that 
such  a stodgy,  commonplace  measure  seems  too  pedestrian  for  our  use.  Then 
let  us  indulge  in  a little  status-connected  statistical  and  logical  inter- 
pretation. We  may  say  that  unless  the  clinicians  were  showing  some  agreement 
between  themselves  the  items  judged  would  not  be  statistically  significantly 
distinguished  one  from  the  other.  A measure  of  the  significance  with  which 
the  items  arc  discriminated  would  than  be  closely  related  to  inter- judge 
agreement  and  might  serve  as  an  indirect  measure  thereof.  ThiB  assumption 
is  sensible  and  supposedly  the  resulting  measure  might  be  informative  pro- 
vided we  kept  firmly  in  mind  how  i.t  had  been  derived,  For  such  a purpose 
we  decided  to  use  Hoyt's  r which  gives  a measure  of  the  reliability  of  the 
average  judgments  for  thtf’sevcral  items  (3),  When  applied  to  our  data,  it 
gives  an  r of -f  .97*  Its  use  for  our  purposes  represents  fantasy  in  inter- 
judge agreement  anong  clinicians. 
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Rank  Order  Correlations  of  Each  Judge  :s  Ratings 
v/ith  Avernge  Rating  of  60  Judges 
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